AITopics | sea language

Collaborating Authors

sea language

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

SEA-SafeguardBench: Evaluating AI Safety in SEA Languages and Cultures

Tasawong, Panuthep, Ngui, Jian Gang, Aji, Alham Fikri, Cohn, Trevor, Limkonchotiwat, Peerat

arXiv.org Artificial IntelligenceDec-8-2025

Safeguard models help large language models (LLMs) detect and block harmful content, but most evaluations remain English-centric and overlook linguistic and cultural diversity. Existing multilingual safety benchmarks often rely on machine-translated English data, which fails to capture nuances in low-resource languages. Southeast Asian (SEA) languages are underrepresented despite the region's linguistic diversity and unique safety concerns, from culturally sensitive political speech to region-specific misinformation. Addressing these gaps requires benchmarks that are natively authored to reflect local norms and harm scenarios. We introduce SEA-SafeguardBench, the first human-verified safety benchmark for SEA, covering eight languages, 21,640 samples, across three subsets: general, in-the-wild, and content generation. The experimental results from our benchmark demonstrate that even state-of-the-art LLMs and guardrails are challenged by SEA cultural and harm scenarios and underperform when compared to English texts.

large language model, machine learning, natural language, (18 more...)

arXiv.org Artificial Intelligence

2512.05501

Country:

Asia (1.00)
North America > United States (0.93)

Genre: Research Report > New Finding (0.46)

Industry:

Law Enforcement & Public Safety > Crime Prevention & Enforcement (1.00)
Health & Medicine (0.92)
Law > Criminal Law (0.67)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Natural Language > Chatbot (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Performance Analysis > Accuracy (0.67)

Add feedback

SeaLLMs-Audio: Large Audio-Language Models for Southeast Asia

Liu, Chaoqun, Aljunied, Mahani, Chen, Guizhen, Chan, Hou Pong, Xu, Weiwen, Rong, Yu, Zhang, Wenxuan

arXiv.org Artificial IntelligenceNov-4-2025

We introduce SeaLLMs-Audio, the first large audio-language model (LALM) tailored for multiple Southeast Asian (SEA) languages-Indonesian (id), Thai (th), and Vietnamese (vi)-alongside English (en) and Chinese (zh). Trained on a large-scale audio corpus, SeaLLMs-Audio exhibits strong performance across diverse audio-centric tasks, spanning fine-grained audio understanding and voice-based interaction. Its key features include: 1) Multilingual: the model primarily supports 5 languages, namely Indonesian, Thai, Vietnamese, English, and Chinese; 2) Multimodal: the model accepts flexible input modalities, including audio only, text only, as well as audio with text; 3) Multi-task: the model supports a wide range of tasks, including audio analysis tasks such as Audio Captioning, Automatic Speech Recognition, Speech-to-Text Translation, Speech Emotion Recognition, Speech Question Answering, and Speech Summarization. It also enables voice-based dialogue, including answering factual, mathematical, and general knowledge queries. As a significant step towards advancing audio LLMs in Southeast Asia, we expect SeaLLMs-Audio to benefit both the regional research community and industry. To automate LALM evaluation for Southeast Asia, we introduce SeaBench-Audio, a benchmark spanning multiple tasks. Experiments show that SeaLLMs-Audio achieves competitive performance compared with other LALMs on SEA languages.

artificial intelligence, large language model, natural language, (17 more...)

arXiv.org Artificial Intelligence

2511.0167

Country:

Asia > Southeast Asia (0.81)
North America > United States > Minnesota (0.28)

Genre: Research Report (0.50)

Technology:

Information Technology > Artificial Intelligence > Speech > Speech Recognition (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.72)

Add feedback

SEA-LION: Southeast Asian Languages in One Network

Ng, Raymond, Nguyen, Thanh Ngan, Huang, Yuli, Tai, Ngee Chia, Leong, Wai Yi, Leong, Wei Qi, Yong, Xianbin, Ngui, Jian Gang, Susanto, Yosephine, Cheng, Nicholas, Rengarajan, Hamsawardhini, Limkonchotiwat, Peerat, Hulagadri, Adithya Venkatadri, Teng, Kok Wai, Tong, Yeo Yeow, Siow, Bryan, Teo, Wei Yi, Lau, Wayne, Tan, Choon Meng, Ong, Brandon, Ong, Zhi Hao, Montalan, Jann Railey, Chan, Adwin, Antonyrex, Sajeban, Lee, Ren, Choa, Esther, Tat-Wee, David Ong, Liu, Bing Jie Darius, Tjhi, William Chandra, Cambria, Erik, Teo, Leslie

arXiv.org Artificial IntelligenceOct-31-2025

Recently, Large Language Models (LLMs) have dominated much of the artificial intelligence scene with their ability to process and generate natural languages. However, the majority of LLM research and development remains English-centric, leaving low-resource languages such as those in the Southeast Asian (SEA) region under-represented. To address this representation gap, we introduce Llama-SEA-LION-v3-8B-IT and Gemma-SEA-LION-v3-9B-IT, two cutting-edge multilingual LLMs designed for SEA languages. The SEA-LION family of LLMs supports 11 SEA languages, namely English, Chinese, Indonesian, Vietnamese, Malay, Thai, Burmese, Lao, Filipino, Tamil, and Khmer. Our work leverages large-scale multilingual continued pre-training with a comprehensive post-training regime involving multiple stages of instruction fine-tuning, alignment, and model merging. Evaluation results on multilingual benchmarks indicate that our models achieve state-of-the-art performance across LLMs supporting SEA languages. We open-source the models to benefit the wider SEA community.

computational linguistic, large language model, machine learning, (20 more...)

arXiv.org Artificial Intelligence

2504.05747

Country:

Asia (1.00)
North America > United States (0.47)
Europe > Austria > Vienna (0.14)

Genre: Research Report (0.64)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

SEA-BED: Southeast Asia Embedding Benchmark

Ponwitayarat, Wuttikorn, Ng, Raymond, Montalan, Jann Railey, Aung, Thura, Ngui, Jian Gang, Susanto, Yosephine, Tjhi, William, Tasawong, Panuthep, Cambria, Erik, Chuangsuwanich, Ekapol, Nutanong, Sarana, Limkonchotiwat, Peerat

arXiv.org Artificial IntelligenceAug-26-2025

Sentence embeddings are essential for NLP tasks such as semantic search, re-ranking, and textual similarity. Although multilingual benchmarks like MMTEB broaden coverage, Southeast Asia (SEA) datasets are scarce and often machine-translated, missing native linguistic properties. With nearly 700 million speakers, the SEA region lacks a region-specific embedding benchmark. We introduce SEA-BED, the first large-scale SEA embedding benchmark with 169 datasets across 9 tasks and 10 languages, where 71% are formulated by humans, not machine generation or translation. We address three research questions: (1) which SEA languages and tasks are challenging, (2) whether SEA languages show unique performance gaps globally, and (3) how human vs. machine translations affect evaluation. We evaluate 17 embedding models across six studies, analyzing task and language challenges, cross-benchmark comparisons, and translation trade-offs. Results show sharp ranking shifts, inconsistent model performance among SEA languages, and the importance of human-curated datasets for low-resource languages like Burmese.

large language model, machine learning, natural language, (20 more...)

arXiv.org Artificial Intelligence

2508.12243

Country:

North America > United States (1.00)
Asia > Southeast Asia (0.61)
Asia > Middle East > UAE (0.45)
Asia > Japan > Honshū (0.27)

Genre: Research Report > New Finding (1.00)

Industry:

Health & Medicine (1.00)
Education (0.67)

Technology:

Information Technology > Communications > Social Media (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (1.00)
(2 more...)

Add feedback

SEA-HELM: Southeast Asian Holistic Evaluation of Language Models

Susanto, Yosephine, Hulagadri, Adithya Venkatadri, Montalan, Jann Railey, Ngui, Jian Gang, Yong, Xian Bin, Leong, Weiqi, Rengarajan, Hamsawardhini, Limkonchotiwat, Peerat, Mai, Yifan, Tjhi, William Chandra

arXiv.org Artificial IntelligenceFeb-20-2025

With the rapid emergence of novel capabilities in Large Language Models (LLMs), the need for rigorous multilingual and multicultural benchmarks that are integrated has become more pronounced. Though existing LLM benchmarks are capable of evaluating specific capabilities of LLMs in English as well as in various mid- to low-resource languages, including those in the Southeast Asian (SEA) region, a comprehensive and authentic evaluation suite for the SEA languages has not been developed thus far. Here, we present SEA-HELM, a holistic linguistic and cultural LLM evaluation suite that emphasizes SEA languages, comprising five core pillars: (1) NLP Classics, (2) LLM-specifics, (3) SEA Linguistics, (4) SEA Culture, (5) Safety. SEA-HELM currently supports Filipino, Indonesian, Tamil, Thai, and Vietnamese. We also introduce the SEA-HELM leaderboard, which allows users to understand models' multilingual and multicultural performance in a systematic and user-friendly manner.

computational linguistic, sea language, sea-helm, (10 more...)

arXiv.org Artificial Intelligence

2502.14301

Country:

Asia > Timor-Leste (0.14)
Asia > Singapore (0.05)
North America > Canada > Ontario > Toronto (0.04)
(26 more...)

Genre: Research Report > New Finding (0.46)

Industry: Education (0.67)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

Sailor2: Sailing in South-East Asia with Inclusive Multilingual LLMs

Dou, Longxu, Liu, Qian, Zhou, Fan, Chen, Changyu, Wang, Zili, Jin, Ziqi, Liu, Zichen, Zhu, Tongyao, Du, Cunxiao, Yang, Penghui, Wang, Haonan, Liu, Jiaheng, Zhao, Yongchi, Feng, Xiachong, Mao, Xin, Yeung, Man Tsung, Pipatanakul, Kunat, Koto, Fajri, Thu, Min Si, Kydlíček, Hynek, Liu, Zeyi, Lin, Qunshu, Sripaisarnmongkol, Sittipong, Sae-Khow, Kridtaphad, Thongchim, Nirattisai, Konkaew, Taechawat, Borijindargoon, Narong, Dao, Anh, Maneegard, Matichon, Artkaew, Phakphum, Yong, Zheng-Xin, Nguyen, Quan, Phatthiyaphaibun, Wannaphong, Tran, Hoang H., Zhang, Mike, Chen, Shiqi, Pang, Tianyu, Du, Chao, Wan, Xinyi, Lu, Wei, Lin, Min

arXiv.org Artificial IntelligenceFeb-18-2025

Sailor2 is a family of cutting-edge multilingual language models for South-East Asian (SEA) languages, available in 1B, 8B, and 20B sizes to suit diverse applications. Building on Qwen2.5, Sailor2 undergoes continuous pre-training on 500B tokens (400B SEA-specific and 100B replay tokens) to support 13 SEA languages while retaining proficiency in Chinese and English. Sailor2-20B model achieves a 50-50 win rate against GPT-4o across SEA languages. We also deliver a comprehensive cookbook on how to develop the multilingual model in an efficient manner, including five key aspects: data curation, pre-training, post-training, model customization and evaluation. We hope that Sailor2 model (Apache 2.0 license) will drive language development in the SEA region, and Sailor2 cookbook will inspire researchers to build more inclusive LLMs for other under-served languages.

large language model, machine learning, qwen2, (19 more...)

arXiv.org Artificial Intelligence

2502.12982

Country:

North America > United States (0.45)
Asia > East Asia (0.40)
Asia > Indonesia (0.28)
North America > Mexico (0.27)

Genre: Research Report > New Finding (0.92)

Industry: Education > Educational Setting (0.34)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.89)

Add feedback

SeaExam and SeaBench: Benchmarking LLMs with Local Multilingual Questions in Southeast Asia

Liu, Chaoqun, Zhang, Wenxuan, Ying, Jiahao, Aljunied, Mahani, Luu, Anh Tuan, Bing, Lidong

arXiv.org Artificial IntelligenceFeb-10-2025

This study introduces two novel benchmarks, SeaExam and SeaBench, designed to evaluate the capabilities of Large Language Models (LLMs) in Southeast Asian (SEA) application scenarios. Unlike existing multilingual datasets primarily derived from English translations, these benchmarks are constructed based on real-world scenarios from SEA regions. SeaExam draws from regional educational exams to form a comprehensive dataset that encompasses subjects such as local history and literature. In contrast, SeaBench is crafted around multi-turn, open-ended tasks that reflect daily interactions within SEA communities. Our evaluations demonstrate that SeaExam and SeaBench more effectively discern LLM performance on SEA language tasks compared to their translated benchmarks. This highlights the importance of using real-world queries to assess the multilingual capabilities of LLMs.

large language model, machine learning, seabench, (20 more...)

arXiv.org Artificial Intelligence

2502.06298

Country:

Asia > Southeast Asia (0.40)
Asia > Singapore (0.05)
North America > United States > Hawaii (0.04)
(6 more...)

Genre: Research Report > New Finding (0.67)

Industry: Education (0.88)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.96)

Add feedback

SailCompass: Towards Reproducible and Robust Evaluation for Southeast Asian Languages

Guo, Jia, Dou, Longxu, Zeng, Guangtao, Kok, Stanley, Lu, Wei, Liu, Qian

arXiv.org Artificial IntelligenceDec-2-2024

In this paper, we introduce SailCompass, a reproducible and robust evaluation benchmark for assessing Large Language Models (LLMs) on Southeast Asian Languages (SEA). SailCompass encompasses three main SEA languages, eight primary tasks including 14 datasets covering three task types (generation, multiple-choice questions, and classification). To improve the robustness of the evaluation approach, we explore different prompt configurations for multiple-choice questions and leverage calibrations to improve the faithfulness of classification tasks. With SailCompass, we derive the following findings: (1) SEA-specialized LLMs still outperform general LLMs, although the gap has narrowed; (2) A balanced language distribution is important for developing better SEA-specialized LLMs; (3) Advanced prompting techniques (e.g., calibration, perplexity-based ranking) are necessary to better utilize LLMs. All datasets and evaluation scripts are public.

benchmark, dataset, language model, (16 more...)

arXiv.org Artificial Intelligence

2412.01186

Country:

Asia > Southeast Asia (0.05)
Asia > Singapore (0.04)
South America > Colombia > Meta Department > Villavicencio (0.04)
(18 more...)

Genre: Research Report (0.50)

Industry: Education (1.00)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

SeaLLMs 3: Open Foundation and Chat Multilingual Large Language Models for Southeast Asian Languages

Zhang, Wenxuan, Chan, Hou Pong, Zhao, Yiran, Aljunied, Mahani, Wang, Jianyu, Liu, Chaoqun, Deng, Yue, Hu, Zhiqiang, Xu, Weiwen, Chia, Yew Ken, Li, Xin, Bing, Lidong

arXiv.org Artificial IntelligenceJul-28-2024

Large Language Models (LLMs) have shown remarkable abilities across various tasks, yet their development has predominantly centered on high-resource languages like English and Chinese, leaving low-resource languages underserved. To address this disparity, we present SeaLLMs 3, the latest iteration of the SeaLLMs model family, tailored for Southeast Asian languages. This region, characterized by its rich linguistic diversity, has lacked adequate language technology support. SeaLLMs 3 aims to bridge this gap by covering a comprehensive range of languages spoken in this region, including English, Chinese, Indonesian, Vietnamese, Thai, Tagalog, Malay, Burmese, Khmer, Lao, Tamil, and Javanese. Leveraging efficient language enhancement techniques and a specially constructed instruction tuning dataset, SeaLLMs 3 significantly reduces training costs while maintaining high performance and versatility. Our model excels in tasks such as world knowledge, mathematical reasoning, translation, and instruction following, achieving state-of-the-art performance among similarly sized models. Additionally, we prioritized safety and reliability by addressing both general and culture-specific considerations and incorporated mechanisms to reduce hallucinations. This work underscores the importance of inclusive AI, showing that advanced LLM capabilities can benefit underserved linguistic and cultural communities.

dataset, language model, zhang, (14 more...)

arXiv.org Artificial Intelligence

2407.19672

Country:

Asia > Southeast Asia (0.05)
Asia > Singapore (0.04)
North America > Mexico > Mexico City > Mexico City (0.04)
(3 more...)

Genre: Research Report (0.82)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.71)

Add feedback

SEACrowd: A Multilingual Multimodal Data Hub and Benchmark Suite for Southeast Asian Languages

Lovenia, Holy, Mahendra, Rahmad, Akbar, Salsabil Maulana, Miranda, Lester James V., Santoso, Jennifer, Aco, Elyanah, Fadhilah, Akhdan, Mansurov, Jonibek, Imperial, Joseph Marvin, Kampman, Onno P., Moniz, Joel Ruben Antony, Habibi, Muhammad Ravi Shulthan, Hudi, Frederikus, Montalan, Railey, Ignatius, Ryan, Lopo, Joanito Agili, Nixon, William, Karlsson, Börje F., Jaya, James, Diandaru, Ryandito, Gao, Yuze, Amadeus, Patrick, Wang, Bin, Cruz, Jan Christian Blaise, Whitehouse, Chenxi, Parmonangan, Ivan Halim, Khelli, Maria, Zhang, Wenyu, Susanto, Lucky, Ryanda, Reynard Adha, Hermawan, Sonny Lazuardi, Velasco, Dan John, Kautsar, Muhammad Dehan Al, Hendria, Willy Fitra, Moslem, Yasmin, Flynn, Noah, Adilazuarda, Muhammad Farid, Li, Haochen, Lee, Johanes, Damanhuri, R., Sun, Shuo, Qorib, Muhammad Reza, Djanibekov, Amirbek, Leong, Wei Qi, Do, Quyet V., Muennighoff, Niklas, Pansuwan, Tanrada, Putra, Ilham Firdausi, Xu, Yan, Tai, Ngee Chia, Purwarianti, Ayu, Ruder, Sebastian, Tjhi, William, Limkonchotiwat, Peerat, Aji, Alham Fikri, Keh, Sedrick, Winata, Genta Indra, Zhang, Ruochen, Koto, Fajri, Yong, Zheng-Xin, Cahyawijaya, Samuel

arXiv.org Artificial IntelligenceJul-8-2024

Southeast Asia (SEA) is a region rich in linguistic diversity and cultural variety, with over 1,300 indigenous languages and a population of 671 million people. However, prevailing AI models suffer from a significant lack of representation of texts, images, and audio datasets from SEA, compromising the quality of AI models for SEA languages. Evaluating models for SEA languages is challenging due to the scarcity of high-quality datasets, compounded by the dominance of English training data, raising concerns about potential cultural misrepresentation. To address these challenges, we introduce SEACrowd, a collaborative initiative that consolidates a comprehensive resource hub that fills the resource gap by providing standardized corpora in nearly 1,000 SEA languages across three modalities. Through our SEACrowd benchmarks, we assess the quality of AI models on 36 indigenous languages across 13 tasks, offering valuable insights into the current AI landscape in SEA. Furthermore, we propose strategies to facilitate greater AI advancements, maximizing potential utility and resource equity for the future of AI in SEA.

computational linguistic, dataset, sea language, (14 more...)

arXiv.org Artificial Intelligence

2406.10118

Country:

Asia > Southeast Asia (0.24)
Asia > Japan > Honshū > Kantō > Tokyo Metropolis Prefecture > Tokyo (0.14)
Asia > Laos (0.06)
(59 more...)

Genre: Research Report (0.81)

Industry:

Education (0.68)
Information Technology (0.67)
Energy (0.45)

Technology:

Information Technology > Communications > Social Media (1.00)
Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
(6 more...)

Add feedback